2014-02-18
Rani Kalari
Poulami


Overarching approach:
Starting with pathways, find genes, then find all snps within 500Kbp of those genes


bed2seq pipe to lookup actual ref on ref genome
How to go to VCF?  - need ref, alt

Info comes from GWAS platforms (omnifi? platforms)

1) Combine ALL snps from ALL studies into one
	- Add metadata (study, model)
	- Example:
		Starting dir:  /data5/bsi/pharmacogenetics/s112548.GeparQuinto
		Subdirs:
			drwxrwsr-x 12 ecarlson biostat      5301 Feb 16 09:11 PCR_imputed
			drwxrwsr-x 12 ecarlson biostat      5152 Feb 16 20:10 PCR_inv_imputed
			drwxrwsr-x  5 ecarlson biostat      5693 Feb 11 22:54 PCR_inv_observed
			drwxrwsr-x  5 ecarlson biostat      5723 Feb 11 10:51 PCR_observed
2) Lookup REf and Alt using BED tools (bior command line for this)
3) Convert to catalog format 
	- Chrom, start, end,  { ... }
	- Sort and unique by chrom, start, end
	- BGZip
	- Tabix
4) Add annotations;  1000Genomes
5) Convert to VCF
6) Put VCF into Variant-Miner to filter	


Dan to work on building the pipelines to go into catalog and into VCF for Variant miner
Mike to work on taking the original files and resolving Ref/Alt from them

**************
NOTE: What to do about files in PCR_observed/ and PCR_inv_observed/ ????????
	- These are files like:  chr17pvalues.RDATA - gzip'd binary files
	- Do we include these?  Do we parse them thru R first????
	


=======================
From dir:
  $ pwd
  /Users/m054457/DATA/BioR_2011/JavaProjects/Code/bior_pipeline/scripts/UserExamples/GWAS_to_catalog

Create subdirectories for example/test
  $ mkdir PCR_imputed PCR_inv_imputed PCR_inv_observed PCR_observed

Find all files on RCF that are in dirs beginning with PCR_ and match pattern "chr17*.gz"
  $ find /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_* -name "chr17*.gz"
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.her2neg.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/tmp/chr17.her2neg.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/tmp/chr17.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/tmp/chr17.int.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/tmp/chr17.int.her2neg.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.int.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.int.her2neg.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.her2neg.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.int.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.int.her2neg.pvalues.txt.gz

Ignore those with the /tmp/ in the filepath
  $ find /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_* -name "chr17*.gz" | grep -v "/tmp/"
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.her2neg.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.int.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.int.her2neg.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.her2neg.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.int.pvalues.txt.gz
  /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.int.her2neg.pvalues

For chr17, grab first 10 rows of each file type for each subdir
  gzcat /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.her2neg.pvalues.txt.gz     | head | gzip -c > PCR_imputed/chr17.her2neg.pvalues.txt.gz
  gzcat /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.int.her2neg.pvalues.txt.gz | head | gzip -c > PCR_imputed/chr17.int.her2neg.pvalues.txt.gz
  gzcat /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.int.pvalues.txt.gz         | head | gzip -c > PCR_imputed/chr17.int.pvalues.txt.gz
  gzcat /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_imputed/chr17.pvalues.txt.gz             | head | gzip -c > PCR_imputed/chr17.pvalues.txt.gz

  gzcat /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.her2neg.pvalues.txt.gz | head | gzip -c > PCR_inv_imputed/chr17.her2neg.pvalues.txt.gz
  gzcat /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.pvalues.txt.gz         | head | gzip -c > PCR_inv_imputed/chr17.pvalues.txt.gz
  gzcat /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.int.pvalues.txt.gz     | head | gzip -c > PCR_inv_imputed/chr17.int.pvalues.txt.gz
  gzcat /Volumes/data5/bsi/pharmacogenetics/s112548.GeparQuinto/PCR_inv_imputed/chr17.int.her2neg.pvalues.txt.gz|head|gzip -c > PCR_inv_imputed/chr17.int.her2neg.pvalues.txt.gz


These files look like:
$ gzcat PCR_imputed/chr17.her2neg.pvalues.txt.gz 
SNP N OR lci uci pvalue CHR BP all2freq allele1 allele2
rs55904085 1420 0.942522669093175 0.305383951704525 2.90895764756508 0.91759869057581 17 45000234 0.0167701666666667 C T
rs189899 1420 0.996519914197285 0.799034173664731 1.24281535398815 0.975321522258152 17 45000529 0.634500233333333 C G